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Data compression using correlations and stochastic processes in the 
ALICE Time Projection chamber 
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In this paper lossless and a quasi lossless algorithms for the online compression of the data generated by the 
Time Projection Chamber (TPC) detector of the ALICE experiment at CERN are described. 
The first algorithm is based on a lossless source code modelling technique, i.e. the original TPC signal informa- 
tion can be reconstructed without errors at the decompression stage. The source model exploits the temporal 
correlation that is present in the TPC data to reduce the entropy of the source. 

The second algorithm is based on a lossy source code modelling technique, i.e. it is lossy if samples of the TPC 
signal are considered one by one. Nevertheless, the source model is quasi-lossless from the point of view of some 
physical quantities that are of main interest for the experiment. These quantities are the shape, the location of 
the center of gravity as well as the total charge of the signal. 

In order to evaluate the consequences of the error introduced by the lossy compression, the results of the 
trajectory tracking algorithms that process data offline are analyzed, in particular, with respect to the noise 
introduced by the compression. The offline analysis has two steps: cluster finder and track finder. The results 
on how these algorithms are affected by the lossy compression are reported. 

In both compression technique entropy coding is applied to the set of events defined by the source model to 
reduce the bit rate to the corresponding source entropy. Using TPC simulated data, the lossless and the lossy 
compression achieve a data reduction to 49.2% of the original data rate and respectively in the range of 35% 
down to 30% depending on the desired precision. 

In this study we have focused on methods which are easy to implement in the frontend TPC electronics. 



1. Introduction 

ALICE (A Large Ion Collider Experiment) is an 
experiment that will start in 2007 at the LHC (Large 
Hadron Collider) at CERN [J 0. The experiment 
will study collisions between heavy ions with energies 
around 5.5 TeV per nucleon. The collisions will take 
place at the center of a set of several detectors, which 
are designed to track and identify the produced par- 
ticles. 

One of the main detectors of the ALICE experiment 
is the Time Projection Chamber (TPC). Its task is 
track finding, momentum measurement and particle 
identification by dE/dx. Good two-track resolution, 
required for correlation studies, is one of the main 
design goals. 

The TPC is a large horizontal cylinder, filled with 
gas, where a suitable axial electric field is present. 
When particles pass through, they ionize the gas 
atoms, and the resulting electrons drift in the electric 
field. By measuring the arrival of electrons at the end 
of the chamber, the TPC can reconstruct the path 
of the original charged particles. The electrons are 
collected by more than 570 000 sensitive pads where 
they create signals. These signals are amplified by a 
preamplifier-shaper and digitalized by a 10-bit A/D 
converter at a sampling frequency of 5.66 MHz. The 
digitalized signal is processed and formatted by an 
Application Specific Integrated Circuit (ASIC) called 
ALTRO (ALICE TPC Read-Out) 0. At this stage, 



the overall throughput of the 570 000 channels is 
around 8.4 GByte/s. 

The total amount of the TPC data is expected to be 
about 1 PBy per year. In order to keep the complex- 
ity and cost of the data storage equipment as low as 
possible, we have to reduce the volume of data using 
suitable data compression methods. The cost reduc- 
tion of the data storage system is roughly proportional 
to the data compression factor. Furthermore, it is bet- 
ter to implement the compression system in the front- 
end electronics at the output of the ALTRO circuit, 
so that the cost for the optical links, which carry data 
out of the chamber to the following stages of the acqui- 
sition chain, could be also reduced. More sophisticated 
methods for TPC data compression based on online 
tracking, which will be used further in data acquisition 
chain are developed in Bergen and Heidelberg |4|. 

The use of a lossy source model, justified by the fact 
that generally it can provide significantly higher com- 
pression ratios compared to lossless models, has the 
drawback that some deterioration in the reconstruc- 
tion of data must be accepted. Lossy source models 
have become very popular in the last decade in the 
field of audio and video compression for their remark- 
able performance. Lossy models have been carefully 
designed so that reconstruction distortions are not 
perceived using psychovisual or psychoacoustic mod- 
els or they remain comparable with the intrinsic signal 
noise. 
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Figure 1: Schematic view of the detection process in TPC (upper part - perspective view, lower part - side view). 



Obviously, for physical data, psychovisual or psy- 
choacoustic tests are meaningless or even not applica- 
ble since the TPC signal is not to be observed by the 
human eye or ear. In the compression noise intro- 
duced on the sample values by the described lossy or 
quasi lossless techniques has been evaluated in terms 
of RMS of the introduced Error (RMSE). 

However, in this case, the RMSE, despite being a 
simple and well known distortion measure, is not very 
useful. The fundamental information that has to be 
extracted from TPC data are not sample values them- 
selves but the physical quantities that enable the re- 
construction of particle trajectories. Therefore, the 
correct way to evaluate the importance of the dis- 
tortion introduced by the compression-decompression 
process has to be related to the high level information 
that is carried by the data. In particular, TPC data 
are collected with the objective of measuring particle 
energy and trajectory. 



Therefore, the most effective way to estimate the 
consequences of the compression distortion error, is 
to observe how the extraction of energy and trajec- 
tories are affected by the compression-decompression 
process. A simple way to obtain these estimates 
is to apply the cluster finding and tracking algo- 
rithms on both simulated data and their compressed- 
decompressed version and compare the results. 

This article is arranged in the following way. In 
section [2 all stochastic processes relevant for particle 
detection in ALICE TPC are briefly described. In 
section |3 TPC data format is specified. In section 
different lossless compression techniques are described 
and their efficiencies are compared. In section [31 the 
fast one dimensional lossy compression technique is 
shown and the impact of compression-decompression 
to the distortion of most important physical quantities 
is demonstrated. 
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2. Stochastic processes in TPC 
2.1. Ionization in gas 

A charged particle that traverses the gas of the 
chamber leaves a track of ionization along its trajec- 
tory. The collisions with the gas atoms are purely 
random. They are characterized by a mean free path 
A between ionizing encounters, which is given by the 
ionization cross-section per electron a- mri and the den- 
sity N of electrons: 

A = l/(ATa ion ). 

Therefore, the number of encounters along the length 
L has the mean of L/ A, and the frequency distribution 
is given by Poisson distribution 

P(L/A,fc) = ^^exp(-L/A). 

The mean free path A is given by the properties of the 
gas and by charged particle characteristics: 



A 



JVprimX/(/?7)' 



where N pi i m is the number of primary electrons per 
cm produced by a Minimum Ionizing Particle (MIP), 
and /(/?7) is Bethe-Bloch curve. 



2.3. Diffusion of electrons 

Produced electrons drift through the gas with an 
effective constant drift velocity in the direction given 
by the electric field E and magnetic field B (which we 
assume are parallel to z-direction) . Drifting electrons 
are scattered on the gas molecules so that their direc- 
tion of motion is randomized in each collision. The 
position of the electron, after drifting over a distance 
-^drift, can be described by 3-D Gaussian distribution: 



P(x,y,z) 
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where {xo,j/0: z o} is the electron creation point and 
transversal diffusion <tt respectively longitudinal dif- 
fusion ctl are given by drift length Ldrift and gas co- 
efficient Dt and Z?l 

0~T — £>T\/ Adrift, 



0~L = D L y/ Ldrift- 

2.4. ExB and unisochronity effect near 
the anode wires 



2.2. Generation of secondary electrons 

The energy loss E to t released in primary ionization 
to atomic electrons is a random variable. It can be de- 
scribed by Photo- Absorbtion Ionization model (PAI) . 
In most cases, if one neglects the atomic shell struc- 
ture, at sufficiently high i? to t (the energy where the 
atomic shell structure is not more important) it obeys 
1/El t rule. 

If the electron produced by the charged particle 
has sufficient kinetic energy E to t, it will produce sec- 
ondary electrons creating thus electron cluster. The 
mean total number of electrons in such cluster is given 
by: 

N tot = Etot ~ Ipot +l, 

Wi on 

where E tot is the energy loss in a primary collision, 
Wion is the effective energy required to produce an 
electron-ion pair and I pot is the first ionization po- 
tential. The random character of the secondary ion- 
ization process smears out structures in E tot spectra, 
atomic shell structure behavior is suppressed. For ex- 
ample in the gas mixture 90% Ne, 10 % C0 2 the E^f 2 
effective paramctrization at lower E to t can be used. 



It has been assumed that the electric and magnetic 
fields in the drift volume are uniform and parallel. 
This, however, is not true close to the anode wires, 
where the electric field becomes radial. Thus the elec- 
trons experience a shift along the wire direction (due 
to the Lorcntz force). If an electron enters the read- 
out chamber at the point (x c ,y c ), it is displaced in the 
x-direction (assuming that the wires are placed along 
y-axis). The new ^/-position of the electron is then 
given by 

y = y c +u)t- (x- x e ) , 

where x is the coordinate of the wire on which an elec- 
tron is collected, and lot is the tangent of Lorentz an- 
gle (ExB effect). The drift length which determines 
z coordinate will be also affected, because of change 
in the path to the anode wire (unisochronity effect). 

2.5. Signal generation 

Inside the readout chamber, as an electron drifts 
towards the anode wire, it travels in an increasing 
electric field. Once the electric field is strong enough 
that between collisions with the gas molecules the 
electron can pick up sufficient energy for ionization, 
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another electron is created and the avalanche starts. 
As the number of electrons multiplies in successive 
generations, the avalanche continues to grow until all 
the electrons are collected on the wire. The resulting 
number of electrons created in the avalanche, can be 
described by an exponential probability distribution 

P(q) = - ■ exp-| , 
1 q 

where q is the average avalanche amplitude. 

An electron avalanche collected on the anode wire 
induces a charge on the pad plane. This charge is inte- 
grated over the pad area. The time signal is obtained 
by folding the pad response to the avalanche with the 
shaping function of the preampamplifier-shaper. This 
signal is then sampled with a constant frequency. On 
the top of sampled signal a random electronic noise is 
superimposed. 

As a result a charged particle interacting with gas 
generates a cluster of amplitudes. This cluster is used 
for later estimation of local track position and of local 
energy deposition. The shape of the cluster is used 
as additional information for the estimation of posi- 
tion uncertainties and for the estimation of the overlap 
factor between two tracks. 



tan 2 a ip^Gxf actor (Nprim) 



12N, 



chprim 



(2) 



and <7 y of cluster center in y(pad) direction: 

2 _ -Pt Adrift r , 

tan 2 /3L 2 ad G Lf 

actor (^prim) 2 



chprim 



(3) 



where iV c h is the total number of electrons in cluster, 
A^chprim ^ s ^ ne num ber of primary electrons in cluster, 
G g is the gas gain fluctuation factor parametrization, 
Gi,factor is the secondary ionization fluctuation factor 



and a -a 



describe the contribution of the electronic 



noise and ADC quantization to the resulting sigma of 
the COG. 

The typical resolution in the case of ALICE TPC is 
on the level of cr y ~ 0.8mm and er z ~ 1.0mm integrat- 
ing over all clusters in the TPC. 



2.7. Accuracy of the total amplitude 
measurement 



2.6. Accuracy of local coordinate 
measurement 

The accuracy of the coordinate measurement is lim- 
ited by a track angle which spreads ionization and by 
diffusion which amplifies this spread. 

The track direction with respect to pad plane is 
given by two angles a and (see fig. QJ. For the 
measurement along the pad-row, the angle a between 
the track projected onto the pad plane and pad-row is 
relevant. For the measurement of the the drift coordi- 
nate (z-direction) it is the angle (3 between the track 
and z axis. 

The ionization electrons are randomly distributed 
along the particle trajectory. Fixing the reference x 
position of a electron at the middle of pad-row, the y 
(resp. z) position of the electron is random variable 
characterized by uniform distribution with the width 
L a , where L a is given by the pad length L pa( j and the 
angle a (resp. /?): 



U = L 



pad 



tana 



The diffusion smears out the position of the electron 
with gaussian probability distribution with ctd- Con- 
tribution of the ExB and unisochronity effect is in 
the case of Alice TPC negligible. 

The accuracy of the position measurement can be 
expressed as: 

ct z of cluster center in z (time) direction: 



A ch 



(S 



The total charge deposited in the clusters can be 
used for particle identification. The important value, 
which is specific for different particle types and dif- 
ferent particle momenta, is the number of primary 
collision per unit length, A c h pr im- A c h pr im is a ran- 
dom variable described by Poisson distribution. Due 
to the secondary ionization and gas gain fluctuations 
the total charge is described by very broad Landau 
distribution. 



3. The ALICE TPC read-out data format 



Before describing the compression algorithm, it is 
necessary to spend a few words on the format of data 
at the output of ALTRO circuit, in order to under- 
stand how the compression algorithms arc applied. 
Such data are indeed the input of the compression 
system 0,0- 

In the ALTRO data format only the samples over 
a given threshold are considered, while the others are 
discarded. This means that, if we call bunch a group 
of adjacent over-threshold samples coming from one 
pad, the signal can be represented "bunch by bunch" . 
More precisely, a bunch is described by three fields: 
temporal information (temporal position of the last 
sample in the bunch), one 10-bit word, bunch length 
(i.e. the number of samples in the bunch, one 10- 
bit word), and sample amplitude values (few 10- bit 
words). 
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4. Lossless compression of TPC signals 

The lossless techniques of the data compression are 
based on the fact that TPC sample values (ADC 
and temporal) are not equally probable. A theoret- 
ical lower limit on the average word size using Huff- 
man codding, or arithmetic coding lossless technique 
is given by entropy of the data source: 

E(p) =J2p(A)log 2 p(A) (4) 

The lossless techniques described in this paper are 
based mainly on an appropriate probability model for 
each data held of the ALTRO data format. Specific 
probability models for each sample in a bunch were 
developed. These models intend to capture both tem- 
poral correlation among samples and the characteris- 
tic shape of TPC electrical pulses. 

4.1. Time information 

As already mentioned, in the ALTRO data format 
time information is represented as the 10-bit num- 
ber of the time-bin of the last sample of the bunch. 
The probability distribution of this variable is roughly 
uniform. In order to achieve better compression ra- 
tio this variable is substituted by the distance be- 
tween two consecutive bunches. The probability of 
this variable is described by exponential distribution 
with much lower entropy factor. The entropy of tem- 
poral information is given by mean distance between 
two bunches. It depends on the event multiplicity, 
noise level and local occupancy, which is known func- 
tion of the pad-row radius. In order to optimize en- 
tropy coding, it will be necessary to investigate prob- 
ability distribution as a function of track multiplicity. 
This information will be known from other faster AL- 
ICE detectors. 

The mean number of bits used for the coding of time 
information is roughly 4.9 bits for the full event with 
maximal track density. Using different codes in differ- 
ent places inside TPC, an additional 6% reduction in 
time information can be achieved. 

4.2. Bunch length 

In the ALTRO data format, the bunch length is 
represented as a 10-bit code number of samples in 
the bunch. The bunch length depends on the diffu- 
sion, the angular effect and the total deposited energy. 
There is no apparent correlation with data coded be- 
fore. Small diffusion for short drift length is compen- 
sated by big angular effect. The total deposited en- 
ergy is known only after coding of the bunch length. 
Since no apparent correlation with other data (e.g. 
length of adjacent bunches) exists and no better model 



Bunch length 


freq 





1 


2 


3 


4 


1 


136 


2.21 










2 


279 


4.04 


4.04 








3 


422 


4.64 


5.5 


4.64 






4 


241 


4.18 


6.67 


6.02 


4.18 




5 


53 


3.83 


6.1 


7.15 


6.1 


3.83 



Table I Entropy of the sample data as a function of the 
sample position in the bunch. Frequency of the sample 
length is given in arbitrary units. 



(i.e. a model of events with lower entropy) could be 
found, this information is coded directly. 



4.3. Sample values coding 

Sample values are the main contribution to the re- 
sulting data volume. This subsection describes, first a 
basic model, and then introduces a more sophisticated 
one, that can provide higher performances in terms of 
compression efficiency. 

Data compression can be obtained by directly ap- 
plying entropy coding to the sample values without 
any modelling of the information source. This method 
will be referred bellow (in table lf*T|l as Entropy Coding 
(EC). 



4.4. Coding model based on the sample 
position 

Improvements in compression performance can be 
obtained by appropriate modelling. A hrst improve- 
ment has been achieved by the fact that the statistics 
of the signal sample values depend on the position of 
the sample itself in the bunch. 

Due to the pseudo Gaussian shape of most of the 
bunches, the first and the last sample of each bunch 
are likely to have a smaller value with respect to those 
in central positions. Similarly, small values are also 
expected for isolated samples, i.e. belonging to one- 
sample bunches (see table HJ. 

Therefore, a classification of the samples into three 
classes was chosen: one class for isolated samples, one 
for samples at the beginning and at the end of mul- 
tiple sample bunches, and the last for samples in the 
central positions of a bunch. Using three different 
probability distributions for entropy coding the sam- 
ple values can be coded more efficiently than using 
only one probability distribution. This coding scheme 
will be referred in table [n] as coding using Sample 
Position (SP). 
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Time 


Length 


Samples 


Total 


Altro 


10 bits 


10 bits 


38.1 bits 


58.1 bits(100%) 


EC 


4.9 bits 


3.1 bits 


22.4 bits 


30.3 bits (52.5%) 


SP 


4.9 bits 


3.1 bits 


21.3 bits 


29.2 bits (50.3%) 


TC 


4.9 bits 


3.1 bits 


20.7 bits 


28.6 bits (49.2%) 



Table II Performance of several lossless techniques 
compared to the zero suppressed ALTRO data format. 
ALTRO: original ALTRO data; EC: entropy coding of 
sample values, bunch length, and time information; SP: 
classification of samples according to their position (3 
code tables used); TC: coding technique that exploits 
temporal correlation (20 code tables used). Numbers in 
the columns represent the number of bits per bunch 
dedicated to each field; numbers in the right column 
represent the overall number of bits per bunch, and, in 
parenthesis, the size with respect to the original ALTRO 
data format. 



4.5. Source models exploiting temporal 
correlation 

Improvement on compression performances can be 
expected by exploiting temporal correlation, i.e. the 
correlation between consecutive samples; this can be 
done by implementing a suitable prediction scheme. 

This approach is explained on the example, where 
a three-sample bunch is considered. Let us assume 
that the first two samples have already been coded 
and that the third one has to be coded. The code to 
be used for sample No. 3 may be chosen among eight 
possible codes according to the value of sample No. 2. 
In particular, this is done by subdividing the range of 
sample No. 2 (i.e. 0. . . 1023) into different intervals, 
and associating a different code (for the third sample) 
to each of these intervals. 

This conditioned probability model can be extended 
to all the samples that are not in the first position 
in the bunch and for any bunch length. However, 
if the real-time implementation constraints are taken 
into account, and, in particular, the need to reduce 
the memory size of the model, it is not good to have 
an exceedingly large number of codes. Consequently, 
samples arc partitioned into four classes only, to keep 
the complexity of the model low. This limitation does 
reduce the efficiency of the model but the reduction 
is only of the order of 0.6%. This coding scheme will 
be referred in table |H] as coding using Temporal Cor- 
relation (TC). 

4.6. Comparison of different lossless 
technique 

The results of different lossless approaches on sim- 
ulated TPC data are shown in table |HJ It may be no- 
ticed that the latter TC technique provides a compres- 

THLT002 



sion of data down to 49.2% of the original size. Even 
this best technique provides reduction factor only by 
3% better then direct EC technique. 

Additional attempt tried to use predicted mean 
cluster shape information. Knowing the position of 
the bunch, the diffusion given by drift length (Ldrift) 
and inclination angle for primary particles are known. 
However, due to the fluctuation of cluster shape and 
due to the large amount of secondary particles with 
unknown angles, this prediction is not very good, and 
the entropy of the samples is reduced only by addi- 
tional factor 2%. 

4.7. Space correlation 

In the trial to exploit space correlation, three loss- 
less models have been considered. The first is based on 
spatially conditioned probability, the second on a pre- 
dictive model, third on 2-dimensional cluster finder, 
with residual saving. 

The first one is the equivalent, in the spatial do- 
main, to what has been done for time correlation. Dif- 
ferent codes are available to code the samples; for each 
sample, the appropriate code is selected according to 
the value of the samples in the same time-bin but 
in adjacent pads. This method provides poorer per- 
formance when compared with the one which exploits 
time correlation (the comparison being done using the 
same model complexity, i.e. number of probability dis- 
tributions available in memory). Moreover, these two 
techniques cannot be easily combined, i.e. it is diffi- 
cult to exploit both temporal and spatial correlations 
at the same time, because this would require a very 
large number of probability distributions (i.e. code 
tables). 

The second method that has been investigated uses 
the prediction of the sample values from the samples in 
adjacent pads and coding the error of this prediction. 
Unfortunately, also for this model, the performance is 
not very good. 

Pulses in one pad-row often resemble temporally 
shifted versions of those in the adjacent pad-row. The 
two methods described above have been modified by 
adding the first stage which shifts pulses so as to in- 
crease spatial correlation with adjacent. Although 
the performance has slightly improved, the increase 
of the compression efficiency was lower than expected. 
The correlations are relatively small. The main prob- 
lems here are in the big amount of secondary particles 
crossing TPC with unknown (3 angle (not pointing 
to the primary vertex) , big spread of the particle mo- 
menta (unknown a angle) and the Landau fluctuation 
of deposited energy on different pad rows, which is al- 
most uncorrelated. Moreover, the position of the orig- 
inal track relative to the pad, affects the correlation 
by a large factor. The signal amplitude in adjacent 
pads and adjacent pad-rows are very weakly corre- 
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lated, unless the position and direction of the track is 
known. 

In order to get better knowledge of the track posi- 
tion, two-dimensional cluster finding can be done be- 
fore. The entropy of the stored residuals is by 30% 
lower than entropy of the original samples but there 
are problems with the track overlaps and with descrip- 
tion of the cluster topology (i.e. where to store resid- 
uals). 

Based on these results we conclude that it is not 
simple to exploit spatial correlation (i.e. correlations 
between adjacent channels). There might be more so- 
phisticated and complex lossless models able to exploit 
it, but relatively simple models seem to fail. 



5. Lossy compression of TPC signals 

5.1. Fluctuation and accuracy of the 
amplitude measurement 

The number of primary ionization electrons pro- 
duced by the charged particle in the gas is the ran- 
dom variable described by Poisson distribution with 
the mean value 14.35 cm -1 for minimum ionizing 
particle in the gas of Alice TPC. The secondary elec- 
tron production (described by E~~ 2 2 probability dis- 
tribution) increases the number of produced electrons. 
Maximum probable value is 25 cm -1 of total electrons. 
This effect also smears probability distribution to the 
relatively broader Landau distribution. 

Due to the angular effect and diffusion, electrons are 
distributed among several time-bins and pads. The 
number of electrons which contributes to the given 
pad and time-bin is described roughly by Poisson dis- 
tribution. Each of the registered electrons is subject of 
gas multiplication which is described by exponential 
probability distribution. Over this, additional elec- 
tronic noise is superimposed to each signal. 

If we fix the track position and the number of pri- 
mary electrons, the remaining sample uncertainties 
can be in the first approximation estimated as: 



GxA 



(5) 



where er no isc is given by electronic noise and sampling 
imprecision, and G is the gain conversion factor. 

The situation is more complicated, data samples are 
correlated through the time response function and the 
pad response function. The relative correlation be- 
tween the samples depends on the ratio of the width of 
the response functions to the width given by stochas- 
tic processes. 

5.2. Dynamic precision of the digitization 

In the following study, dynamic precision of sample 
quantization was investigated. The quantization was 





no 


K oS =l 


Kos—1-5 


K oB =l 








A'cor — 1 • 5 




range 


0..1024 


0..62 


0..42 


0..33 


entropy 


5.7 


3.89(3.39) 


3.34(2.84) 


2.92(2.45) 


(TP 


1.000 


1.000 


1.006 


1.030 


(TT 


1.000 


1.005 


1.015 


1.04 


ctprf 


0.069 


0.070 


0.071 


0.074 


UTRF 


0.079 


0.079 


0.081 


0.083 


Gain 


4.61±0.69 


4.63±0.70 


4.64±0.71 


4.66±0.72 



Table III The influence of the lossy compression with 
different lossy parameters to the cluster characteristic. In 
row 1 effective range mapping shown. Entropy of the 
samples are shown in row 2. Numbers in parenthesis 
represent effective entropy od data sample, using 
different code table for different sample position in the 
bunch. In row number 3 and 4 (op and ctt) the influence 
of the lossy compression to the cluster space resolution in 
pad respectively in time direction is shown. Row number 
5 and 6 shows the relative influence of compression to the 
shape of cluster in time and pad directions. Gain row 
show the reconstructed ratio between total deposited 
energy and numbers of contributing electrons to the 
cluster. 



chosen to correspond to the sample deviation, modi- 
fying formula iTSJl to: 



5 rf = \ K. 



K 2 OI xA 



(6) 



where K a g and K cot factors were chosen as free pa- 
rameters. K Q ff is proportional to the electronic noise 
and K COI is given by statistics of the stochastic pro- 
cesses and by correlations. Different combinations of 
these factors were investigated. 

In table IIIII the influence of different quantization 
on the precision of the cluster characteristic determi- 
nation is shown. 

The gain factor G — A^/N e \ (A t is the total charge 
in cluster, N e \ is the number of electrons contribut- 
ing to the cluster) measures the precision of the local 
deposited energy determination. This factor is im- 
portant for dEdx measurement and consequently for 
particle identification (PID). The influence of the com- 
pression on the cluster position determination varies 
between to 4%, depending on the compression fac- 
tor, as can be expected. The shape of the cluster 
(cprf and ctrf), important for cluster quality deter- 
mination, varies between 1 to 6%. 

In table IIVI the influence of the compression on the 
tracking is shown. Reported distortions in p t and an- 
gular resolution are slightly smaller than in the case 
of the cluster position(0% up to 3.5%) . This is due 
to the other stochastic processes which contribute to 
the track parameters e.g. the multiple scattering. For 
high-momentum particles, where the influence of mul- 
tiple scattering is not so important, the expected dis- 



THLT002 



8 



CHEP03, La Jolla, California, March 24-28, 2003 





no 


K oH =l 

^cor — 1 


K oB — 1.5 
Kcor = 1*5 


K off =l 

^"cor — ■ 2 


a$ [mrad] 


1.399±0.030 


1.378±0.030 


1.406±0.030 


1.403±0.03 


tie [mrad] 


0.997±0.018 


0.992±0.018 


1.002±0.018 


0.989±0.018 




0.881±0.011 


0.885±0.011 


0.886±0.011 


0.905±0.011 




2.96±0.11 


2.98±0.11 


3.06±0.11 


3.20±0.11 



Table IV The influence of the lossy compression with different lossy parameters on the track characteristics. 



tortion will be determined by the cluster position dis- 
tortions. 

Reducing the number of the possible sample values, 
vector quantization of bunches were also investigated. 
Additional reduction factor of ~6% was achieved on 
top of the results reported in table IIIII 



ical quantities, particle momenta and dEdx is min- 
imal. This approach achieves compression rates in 
the range from 35% down to 30%, depending on the 
desired precision. In this study we have focused on 
methods which are easy to implement in the frontend 
TPC electronics. 



6. Conlusions 
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In this paper lossless and a quasi lossless algorithms for the online compression of the data generated by the 
Time Projection Chamber (TPC) detector of the ALICE experiment at CERN are described. 
The first algorithm is based on a lossless source code modelling technique, i.e. the original TPC signal informa- 
tion can be reconstructed without errors at the decompression stage. The source model exploits the temporal 
correlation that is present in the TPC data to reduce the entropy of the source. 

The second algorithm is based on a lossy source code modelling technique, i.e. it is lossy if samples of the TPC 
signal are considered one by one. Nevertheless, the source model is quasi-lossless from the point of view of some 
physical quantities that are of main interest for the experiment. These quantities are the shape, the location of 
the center of gravity as well as the total charge of the signal. 

In order to evaluate the consequences of the error introduced by the lossy compression, the results of the 
trajectory tracking algorithms that process data offline are analyzed, in particular, with respect to the noise 
introduced by the compression. The offline analysis has two steps: cluster finder and track finder. The results 
on how these algorithms are affected by the lossy compression are reported. 

In both compression technique entropy coding is applied to the set of events defined by the source model to 
reduce the bit rate to the corresponding source entropy. Using TPC simulated data, the lossless and the lossy 
compression achieve a data reduction to 49.2% of the original data rate and respectively in the range of 35% 
down to 30% depending on the desired precision. 

In this study we have focused on methods which are easy to implement in the frontend TPC electronics. 



1. Introduction 

ALICE (A Large Ion Collider Experiment) is an 
experiment that will start in 2007 at the LHC (Large 
Hadron Collider) at CERN [1, 2]. The experiment 
will study collisions between heavy ions with energies 
around 5.5 TeV per nucleon. The collisions will take 
place at the center of a set of several detectors, which 
are designed to track and identify the produced par- 
ticles. 

One of the main detectors of the ALICE experiment 
is the Time Projection Chamber (TPC). Its task is 
track finding, momentum measurement and particle 
identification by dE/dx. Good two-track resolution, 
required for correlation studies, is one of the main 
design goals. 

The TPC is a large horizontal cylinder, filled with 
gas, where a suitable axial electric field is present. 
When particles pass through, they ionize the gas 
atoms, and the resulting electrons drift in the electric 
field. By measuring the arrival of electrons at the end 
of the chamber, the TPC can reconstruct the path 
of the original charged particles. The electrons are 
collected by more than 570 000 sensitive pads where 
they create signals. These signals are amplified by a 
preamplifier-shaper and digitalized by a 10-bit A/D 
converter at a sampling frequency of 5.66 MHz. The 
digitalized signal is processed and formatted by an 
Application Specific Integrated Circuit (ASIC) called 
ALTRO (ALICE TPC Read-Out) [3]. At this stage, 



the overall throughput of the 570 000 channels is 
around 8.4 GByte/s. 

The total amount of the TPC data is expected to be 
about 1 PBy per year. In order to keep the complex- 
ity and cost of the data storage equipment as low as 
possible, we have to reduce the volume of data using 
suitable data compression methods. The cost reduc- 
tion of the data storage system is roughly proportional 
to the data compression factor. Furthermore, it is bet- 
ter to implement the compression system in the front- 
end electronics at the output of the ALTRO circuit, 
so that the cost for the optical links, which carry data 
out of the chamber to the following stages of the acqui- 
sition chain, could be also reduced. More sophisticated 
methods for TPC data compression based on online 
tracking, which will be used further in data acquisition 
chain are developed in Bergen and Heidelberg [4]. 

The use of a lossy source model, justified by the fact 
that generally it can provide significantly higher com- 
pression ratios compared to lossless models, has the 
drawback that some deterioration in the reconstruc- 
tion of data must be accepted. Lossy source models 
have become very popular in the last decade in the 
field of audio and video compression for their remark- 
able performance. Lossy models have been carefully 
designed so that reconstruction distortions are not 
perceived using psychovisual or psychoacoustic mod- 
els or they remain comparable with the intrinsic signal 
noise. 
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Figure 1: Schematic view of the detection process in TPC (upper part - perspective view, lower part - side view). 



Obviously, for physical data, psychovisual or psy- 
choacoustic tests are meaningless or even not applica- 
ble since the TPC signal is not to be observed by the 
human eye or ear. In [5], the compression noise intro- 
duced on the sample values by the described lossy or 
quasi lossless techniques has been evaluated in terms 
of RMS of the introduced Error (RMSE). 

However, in this case, the RMSE, despite being a 
simple and well known distortion measure, is not very 
useful. The fundamental information that has to be 
extracted from TPC data are not sample values them- 
selves but the physical quantities that enable the re- 
construction of particle trajectories. Therefore, the 
correct way to evaluate the importance of the dis- 
tortion introduced by the compression-decompression 
process has to be related to the high level information 
that is carried by the data. In particular, TPC data 
are collected with the objective of measuring particle 
energy and trajectory. 



Therefore, the most effective way to estimate the 
consequences of the compression distortion error, is 
to observe how the extraction of energy and trajec- 
tories are affected by the compression-decompression 
process. A simple way to obtain these estimates 
is to apply the cluster finding and tracking algo- 
rithms on both simulated data and their compressed- 
decompressed version and compare the results. 

This article is arranged in the following way. In 
section 2, all stochastic processes relevant for particle 
detection in ALICE TPC are briefly described. In 
section 3, TPC data format is specified. In section 4, 
different lossless compression techniques are described 
and their efficiencies are compared. In section 5, the 
fast one dimensional lossy compression technique is 
shown and the impact of compression-decompression 
to the distortion of most important physical quantities 
is demonstrated. 
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2. Stochastic processes in TPC 



2.3. Diffusion of electrons 



2.1. Ionization in gas 

A charged particle that traverses the gas of the 
chamber leaves a track of ionization along its trajec- 
tory. The collisions with the gas atoms are purely 
random. They are characterized by a mean free path 
A between ionizing encounters, which is given by the 
ionization cross-section per electron a- lon and the den- 
sity N of electrons: 

A= l/(7V<7i 0n ). 

Therefore, the number of encounters along the length 
L has the mean of L/\, and the frequency distribution 
is given by Poisson distribution 



P(L/X,k) 



exp(— L/X). 



The mean free path A is given by the properties of the 
gas and by charged particle characteristics: 



1 



iVprimX/(/? 7 )' 



where N pr \ m is the number of primary electrons per 
cm produced by a Minimum Ionizing Particle (MIP), 
and f(P^f) is Bethe-Bloch curve. 



2.2. Generation of secondary electrons 

The energy loss E tot released in primary ionization 
to atomic electrons is a random variable. It can be de- 
scribed by Photo- Absorbtion Ionization model (PAI). 
In most cases, if one neglects the atomic shell struc- 
ture, at sufficiently high E tot (the energy where the 
atomic shell structure is not more important) it obeys 
1/El t rule. 

If the electron produced by the charged particle 
has sufficient kinetic energy E to t, it will produce sec- 
ondary electrons creating thus electron cluster. The 
mean total number of electrons in such cluster is given 
by: 



JVt. 



'pot 



+ 1, 



where E to t is the energy loss in a primary collision, 
Wj on is the effective energy required to produce an 
electron-ion pair and J pot is the first ionization po- 
tential. The random character of the secondary ion- 
ization process smears out structures in E to t spectra, 
atomic shell structure behavior is suppressed. For ex- 
ample in the gas mixture 90% Ne, 10 % C0 2 the £ t ~ 2 ' 2 
effective parametrization at lower E to t can be used. 



Produced electrons drift through the gas with an 
effective constant drift velocity in the direction given 
by the electric field E and magnetic field B (which we 
assume are parallel to ^-direction). Drifting electrons 
are scattered on the gas molecules so that their direc- 
tion of motion is randomized in each collision. The 
position of the electron, after drifting over a distance 
^drift, can be described by 3-D Gaussian distribution: 



P(x,y,z) 



1 



27T(Tt 

1 



2tto"t 



exp 



exp 



2na L 



exp 



(x - Xq) 2 

(y - yo) 2 

(z - I/drift) 2 



2al 



(1) 



where {zo,t/o ; ^o} is the electron creation point and 
transversal diffusion or respectively longitudinal dif- 
fusion <tl are given by drift length Ldrift and gas co- 
efficient £)t and £>l 

(7 T = Dt V^drift, 



= Dh V ^drift • 

2.4. ExB and unisochronity effect near 
the anode wires 

It has been assumed that the electric and magnetic 
fields in the drift volume are uniform and parallel. 
This, however, is not true close to the anode wires, 
where the electric field becomes radial. Thus the elec- 
trons experience a shift along the wire direction (due 
to the Lorentz force). If an electron enters the read- 
out chamber at the point (x e ,y e ), it is displaced in the 
x-direction (assuming that the wires are placed along 
2/-axis). The new y-position of the electron is then 
given by 

y = y e + ujt ■ (x - x e ) , 

where x is the coordinate of the wire on which an elec- 
tron is collected, and lot is the tangent of Lorentz an- 
gle (ExB effect). The drift length which determines 
z coordinate will be also affected, because of change 
in the path to the anode wire (unisochronity effect). 

2.5. Signal generation 

Inside the readout chamber, as an electron drifts 
towards the anode wire, it travels in an increasing 
electric field. Once the electric field is strong enough 
that between collisions with the gas molecules the 
electron can pick up sufficient energy for ionization, 
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another electron is created and the avalanche starts. 
As the number of electrons multiplies in successive 
generations, the avalanche continues to grow until all 
the electrons are collected on the wire. The resulting 
number of electrons created in the avalanche, can be 
described by an exponential probability distribution 

P(q) = r ■ exp-I , 

q q 

where q is the average avalanche amplitude. 

An electron avalanche collected on the anode wire 
induces a charge on the pad plane. This charge is inte- 
grated over the pad area. The time signal is obtained 
by folding the pad response to the avalanche with the 
shaping function of the preampamplifier-shaper. This 
signal is then sampled with a constant frequency. On 
the top of sampled signal a random electronic noise is 
superimposed. 

As a result a charged particle interacting with gas 
generates a cluster of amplitudes. This cluster is used 
for later estimation of local track position and of local 
energy deposition. The shape of the cluster is used 
as additional information for the estimation of posi- 
tion uncertainties and for the estimation of the overlap 
factor between two tracks. 



tan 2 a Lp ad GLfactor(-Np rim ) 



UN, 



chprim 



(2) 



and er y of cluster center in y(pad) direction: 



D\ I/drift 



ycoG 



tan' 



13^ 



N ch 

prim 



127V, 



chprim 



(3) 



where 7V C h is the total number of electrons in cluster, 
TVchprim is the number of primary electrons in cluster, 
G g is the gas gain fluctuation factor parametrization, 
Gxfactor is the secondary ionization fluctuation factor 
and cr n0 ise describe the contribution of the electronic 
noise and ADC quantization to the resulting sigma of 
the COG. 

The typical resolution in the case of ALICE TPC is 
on the level of <j y ~ 0.8mm and er z ~ 1.0mm integrat- 
ing over all clusters in the TPC. 



2.7. Accuracy of the total amplitude 
measurement 



2.6. Accuracy of local coordinate 
measurement 

The accuracy of the coordinate measurement is lim- 
ited by a track angle which spreads ionization and by 
diffusion which amplifies this spread. 

The track direction with respect to pad plane is 
given by two angles a and (3 (see fig. 1). For the 
measurement along the pad-row, the angle a between 
the track projected onto the pad plane and pad-row is 
relevant. For the measurement of the the drift coordi- 
nate (^-direction) it is the angle (3 between the track 
and z axis. 

The ionization electrons are randomly distributed 
along the particle trajectory. Fixing the reference x 
position of a electron at the middle of pad-row, the y 
(resp. z) position of the electron is random variable 
characterized by uniform distribution with the width 
L a , where L a is given by the pad length L pa d and the 
angle a (resp. /3): 

L a = Lpad tan a 

The diffusion smears out the position of the electron 
with gaussian probability distribution with <7d- Con- 
tribution of the ExB and unisochronity effect is in 
the case of Alice TPC negligible. 

The accuracy of the position measurement can be 
expressed as: 

<7 Z of cluster center in z (time) direction: 



D L Adrift ^ 



+ 



The total charge deposited in the clusters can be 
used for particle identification. The important value, 
which is specific for different particle types and dif- 
ferent particle momenta, is the number of primary 
collision per unit length, iV C hprim- -^Vchprim is a ran- 
dom variable described by Poisson distribution. Due 
to the secondary ionization and gas gain fluctuations 
the total charge is described by very broad Landau 
distribution. 



3. The ALICE TPC read-out data format 



Before describing the compression algorithm, it is 
necessary to spend a few words on the format of data 
at the output of ALTRO circuit, in order to under- 
stand how the compression algorithms are applied. 
Such data are indeed the input of the compression 
system [1, 3]. 

In the ALTRO data format only the samples over 
a given threshold are considered, while the others are 
discarded. This means that, if we call bunch a group 
of adjacent over-threshold samples coming from one 
pad, the signal can be represented "bunch by bunch" . 
More precisely, a bunch is described by three fields: 
temporal information (temporal position of the last 
sample in the bunch), one 10-bit word, bunch length 
(i.e. the number of samples in the bunch, one 10- 
bit word), and sample amplitude values (few 10-bit 
words). 
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4. Lossless compression of TPC signals 

The lossless techniques of the data compression are 
based on the fact that TPC sample values (ADC 
and temporal) are not equally probable. A theoret- 
ical lower limit on the average word size using Huff- 
man codding, or arithmetic coding lossless technique 
is given by entropy of the data source: 

E{p) = Y,P(A)\og 2 p{A) (4) 

The lossless techniques described in this paper are 
based mainly on an appropriate probability model for 
each data field of the ALTRO data format. Specific 
probability models for each sample in a bunch were 
developed. These models intend to capture both tem- 
poral correlation among samples and the characteris- 
tic shape of TPC electrical pulses. 

4.1. Time information 

As already mentioned, in the ALTRO data format 
time information is represented as the 10-bit num- 
ber of the time-bin of the last sample of the bunch. 
The probability distribution of this variable is roughly 
uniform. In order to achieve better compression ra- 
tio this variable is substituted by the distance be- 
tween two consecutive bunches. The probability of 
this variable is described by exponential distribution 
with much lower entropy factor. The entropy of tem- 
poral information is given by mean distance between 
two bunches. It depends on the event multiplicity, 
noise level and local occupancy, which is known func- 
tion of the pad-row radius. In order to optimize en- 
tropy coding, it will be necessary to investigate prob- 
ability distribution as a function of track multiplicity. 
This information will be known from other faster AL- 
ICE detectors. 

The mean number of bits used for the coding of time 
information is roughly 4.9 bits for the full event with 
maximal track density. Using different codes in differ- 
ent places inside TPC, an additional 6% reduction in 
time information can be achieved. 

4.2. Bunch length 

In the ALTRO data format, the bunch length is 
represented as a 10-bit code number of samples in 
the bunch. The bunch length depends on the diffu- 
sion, the angular effect and the total deposited energy. 
There is no apparent correlation with data coded be- 
fore. Small diffusion for short drift length is compen- 
sated by big angular effect. The total deposited en- 
ergy is known only after coding of the bunch length. 
Since no apparent correlation with other data (e.g. 
length of adjacent bunches) exists and no better model 
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Table I Entropy of the sample data as a function of the 
sample position in the bunch. Frequency of the sample 
length is given in arbitrary units. 



(i.e. a model of events with lower entropy) could be 
found, this information is coded directly. 



4.3. Sample values coding 

Sample values are the main contribution to the re- 
sulting data volume. This subsection describes, first a 
basic model, and then introduces a more sophisticated 
one, that can provide higher performances in terms of 
compression efficiency. 

Data compression can be obtained by directly ap- 
plying entropy coding to the sample values without 
any modelling of the information source. This method 
will be referred bellow (in table II) as Entropy Coding 
(EC). 



4.4. Coding model based on the sample 
position 

Improvements in compression performance can be 
obtained by appropriate modelling. A first improve- 
ment has been achieved by the fact that the statistics 
of the signal sample values depend on the position of 
the sample itself in the bunch. 

Due to the pseudo Gaussian shape of most of the 
bunches, the first and the last sample of each bunch 
are likely to have a smaller value with respect to those 
in central positions. Similarly, small values are also 
expected for isolated samples, i.e. belonging to one- 
sample bunches (see table I). 

Therefore, a classification of the samples into three 
classes was chosen: one class for isolated samples, one 
for samples at the beginning and at the end of mul- 
tiple sample bunches, and the last for samples in the 
central positions of a bunch. Using three different 
probability distributions for entropy coding the sam- 
ple values can be coded more efficiently than using 
only one probability distribution. This coding scheme 
will be referred in table II as coding using Sample 
Position (SP). 
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Table II Performance of several lossless techniques 
compared to the zero suppressed ALTRO data format. 
ALTRO: original ALTRO data; EC: entropy coding of 
sample values, bunch length, and time information; SP: 
classification of samples according to their position (3 
code tables used); TC: coding technique that exploits 
temporal correlation (20 code tables used). Numbers in 
the columns represent the number of bits per bunch 
dedicated to each field; numbers in the right column 
represent the overall number of bits per bunch, and, in 
parenthesis, the size with respect to the original ALTRO 
data format. 



4.5. Source models exploiting temporal 
correlation 

Improvement on compression performances can be 
expected by exploiting temporal correlation, i.e. the 
correlation between consecutive samples; this can be 
done by implementing a suitable prediction scheme. 

This approach is explained on the example, where 
a three-sample bunch is considered. Let us assume 
that the first two samples have already been coded 
and that the third one has to be coded. The code to 
be used for sample No. 3 may be chosen among eight 
possible codes according to the value of sample No. 2. 
In particular, this is done by subdividing the range of 
sample No. 2 (i.e. 0. . . 1023) into different intervals, 
and associating a different code (for the third sample) 
to each of these intervals. 

This conditioned probability model can be extended 
to all the samples that are not in the first position 
in the bunch and for any bunch length. However, 
if the real-time implementation constraints are taken 
into account, and, in particular, the need to reduce 
the memory size of the model, it is not good to have 
an exceedingly large number of codes. Consequently, 
samples are partitioned into four classes only, to keep 
the complexity of the model low. This limitation does 
reduce the efficiency of the model but the reduction 
is only of the order of 0.6%. This coding scheme will 
be referred in table II as coding using Temporal Cor- 
relation (TC). 

4.6. Comparison of different lossless 
technique 

The results of different lossless approaches on sim- 
ulated TPC data are shown in table II. It may be no- 
ticed that the latter TC technique provides a compres- 
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sion of data down to 49.2% of the original size. Even 
this best technique provides reduction factor only by 
3% better then direct EC technique. 

Additional attempt tried to use predicted mean 
cluster shape information. Knowing the position of 
the bunch, the diffusion given by drift length (Ldrift) 
and inclination angle for primary particles are known. 
However, due to the fluctuation of cluster shape and 
due to the large amount of secondary particles with 
unknown angles, this prediction is not very good, and 
the entropy of the samples is reduced only by addi- 
tional factor 2%. 

4.7. Space correlation 

In the trial to exploit space correlation, three loss- 
less models have been considered. The first is based on 
spatially conditioned probability, the second on a pre- 
dictive model, third on 2-dimensional cluster finder, 
with residual saving. 

The first one is the equivalent, in the spatial do- 
main, to what has been done for time correlation. Dif- 
ferent codes are available to code the samples; for each 
sample, the appropriate code is selected according to 
the value of the samples in the same time-bin but 
in adjacent pads. This method provides poorer per- 
formance when compared with the one which exploits 
time correlation (the comparison being done using the 
same model complexity, i.e. number of probability dis- 
tributions available in memory). Moreover, these two 
techniques cannot be easily combined, i.e. it is diffi- 
cult to exploit both temporal and spatial correlations 
at the same time, because this would require a very 
large number of probability distributions (i.e. code 
tables). 

The second method that has been investigated uses 
the prediction of the sample values from the samples in 
adjacent pads and coding the error of this prediction. 
Unfortunately, also for this model, the performance is 
not very good. 

Pulses in one pad-row often resemble temporally 
shifted versions of those in the adjacent pad-row. The 
two methods described above have been modified by 
adding the first stage which shifts pulses so as to in- 
crease spatial correlation with adjacent. Although 
the performance has slightly improved, the increase 
of the compression efficiency was lower than expected. 
The correlations are relatively small. The main prob- 
lems here are in the big amount of secondary particles 
crossing TPC with unknown ft angle (not pointing 
to the primary vertex) , big spread of the particle mo- 
menta (unknown a angle) and the Landau fluctuation 
of deposited energy on different pad rows, which is al- 
most uncorrelated. Moreover, the position of the orig- 
inal track relative to the pad, affects the correlation 
by a large factor. The signal amplitude in adjacent 
pads and adjacent pad-rows are very weakly corre- 



lated, unless the position and direction of the track is 
known. 

In order to get better knowledge of the track posi- 
tion, two-dimensional cluster finding can be done be- 
fore. The entropy of the stored residuals is by 30% 
lower than entropy of the original samples but there 
are problems with the track overlaps and with descrip- 
tion of the cluster topology (i.e. where to store resid- 
uals). 

Based on these results we conclude that it is not 
simple to exploit spatial correlation (i.e. correlations 
between adjacent channels). There might be more so- 
phisticated and complex lossless models able to exploit 
it, but relatively simple models seem to fail. 



5. Lossy compression of TPC signals 

5.1. Fluctuation and accuracy of the 
amplitude measurement 

The number of primary ionization electrons pro- 
duced by the charged particle in the gas is the ran- 
dom variable described by Poisson distribution with 
the mean value 14.35 cm -1 for minimum ionizing 
particle in the gas of Alice TPC. The secondary elec- 
tron production (described by E~ 2 - 2 probability dis- 
tribution) increases the number of produced electrons. 
Maximum probable value is 25 cm -1 of total electrons. 
This effect also smears probability distribution to the 
relatively broader Landau distribution. 

Due to the angular effect and diffusion, electrons are 
distributed among several time-bins and pads. The 
number of electrons which contributes to the given 
pad and time-bin is described roughly by Poisson dis- 
tribution. Each of the registered electrons is subject of 
gas multiplication which is described by exponential 
probability distribution. Over this, additional elec- 
tronic noise is superimposed to each signal. 

If we fix the track position and the number of pri- 
mary electrons, the remaining sample uncertainties 
can be in the first approximation estimated as: 



+ GxA 



(5) 



where <T no i se is given by electronic noise and sampling 
imprecision, and G is the gain conversion factor. 

The situation is more complicated, data samples are 
correlated through the time response function and the 
pad response function. The relative correlation be- 
tween the samples depends on the ratio of the width of 
the response functions to the width given by stochas- 
tic processes. 

5.2. Dynamic precision of the digitization 

In the following study, dynamic precision of sample 
quantization was investigated. The quantization was 
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Table III The influence of the lossy compression with 
different lossy parameters to the cluster characteristic. In 
row 1 effective range mapping shown. Entropy of the 
samples are shown in row 2. Numbers in parenthesis 
represent effective entropy od data sample, using 
different code table for different sample position in the 
bunch. In row number 3 and 4 (<rp and or) the influence 
of the lossy compression to the cluster space resolution in 
pad respectively in time direction is shown. Row number 
5 and 6 shows the relative influence of compression to the 
shape of cluster in time and pad directions. Gain row 
show the reconstructed ratio between total deposited 
energy and numbers of contributing electrons to the 
cluster. 



chosen to correspond to the sample deviation, modi- 
fying formula (5) to: 



K 2 oS 



+ K? 0I xA 



(6) 



where K Q g and K C0I factors were chosen as free pa- 
rameters. K ff is proportional to the electronic noise 
and K cor is given by statistics of the stochastic pro- 
cesses and by correlations. Different combinations of 
these factors were investigated. 

In table III the influence of different quantization 
on the precision of the cluster characteristic determi- 
nation is shown. 

The gain factor G = A t /N e \ (A t is the total charge 
in cluster, 7V e i is the number of electrons contribut- 
ing to the cluster) measures the precision of the local 
deposited energy determination. This factor is im- 
portant for dEdx measurement and consequently for 
particle identification (PID). The influence of the com- 
pression on the cluster position determination varies 
between to 4%, depending on the compression fac- 
tor, as can be expected. The shape of the cluster 
(cprf and <ttrf), important for cluster quality deter- 
mination, varies between 1 to 6%. 

In table IV the influence of the compression on the 
tracking is shown. Reported distortions in p t and an- 
gular resolution are slightly smaller than in the case 
of the cluster position(0% up to 3.5%) . This is due 
to the other stochastic processes which contribute to 
the track parameters e.g. the multiple scattering. For 
high-momentum particles, where the influence of mul- 
tiple scattering is not so important, the expected dis- 
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Table IV The influence of the lossy compression with different lossy parameters on the track characteristics. 



tortion will be determined by the cluster position dis- 
tortions. 

Reducing the number of the possible sample values, 
vector quantization of bunches were also investigated. 
Additional reduction factor of ~6% was achieved on 
top of the results reported in table III. 



ical quantities, particle momenta and dEdx is min- 
imal. This approach achieves compression rates in 
the range from 35% down to 30%, depending on the 
desired precision. In this study we have focused on 
methods which are easy to implement in the frontend 
TPC electronics. 



6. Conlusions 

Several methods of lossless TPC data compression 
was investigated (sec. 4.6). The best one dimensional 
methods provide compression factor down to 49.2%. 

A lossy compression approach for the data gener- 
ated by the TPC chamber in the ALICE experiment 
has been also investigated. The main idea was to 
preserve, on the level intrinsic noise, the three more 
important local quantities: the cluster position, the 
deposited energy and the shape of the cluster. Keep- 
ing the distortions of the local quantities at a reason- 
able level, the impact on the most interesting phys- 
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